{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This tutorial is available as an IPython notebook at [Malaya/example/sentiment](https://github.com/huseinzol05/Malaya/tree/master/example/sentiment).\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<div class=\"alert alert-info\">\n",
    "\n",
    "This module trained on both standard and local (included social media) language structures, so it is save to use for both.\n",
    "    \n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.8 s, sys: 3.91 s, total: 6.71 s\n",
      "Wall time: 1.96 s\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3397\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n",
      "/home/husein/dev/malaya/malaya/tokenizer.py:214: FutureWarning: Possible nested set at position 3927\n",
      "  self.tok = re.compile(r'({})'.format('|'.join(pipeline)))\n"
     ]
    }
   ],
   "source": [
    "%%time\n",
    "import malaya"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### labels supported\n",
    "\n",
    "Default labels for sentiment module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['negative', 'neutral', 'positive']"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.sentiment.label"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Example texts\n",
    "\n",
    "Copy pasted from random tweets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "string1 = 'Sis, students from overseas were brought back because they are not in their countries which is if something happens to them, its not the other countries’ responsibility. Student dalam malaysia ni dah dlm tggjawab kerajaan. Mana part yg tak faham?'\n",
    "string2 = 'Harap kerajaan tak bukak serentak. Slowly release week by week. Focus on economy related industries dulu'\n",
    "string3 = 'Idk if aku salah baca ke apa. Bayaran rm350 utk golongan umur 21 ke bawah shj ? Anyone? If 21 ke atas ok lah. If umur 21 ke bawah?  Are you serious? Siapa yg lebih byk komitmen? Aku hrp aku salah baca. Aku tk jumpa artikel tu'\n",
    "string4 = 'Jabatan Penjara Malaysia diperuntukkan RM20 juta laksana program pembangunan Insan kepada banduan. Majikan yang menggaji bekas banduan, bekas penagih dadah diberi potongan cukai tambahan sehingga 2025.'\n",
    "string5 = 'Dua Hari Nyaris Hatrick, Murai Batu Ceriwis Siap Meraikan Even Bekasi Bersatu!'\n",
    "string6 = '@MasidiM Moga kerajaan sabah, tidak ikut pkp macam kerajaan pusat. Makin lama pkp, makin ramai hilang pekerjaan. Ti https://t.co/nSIABkkEDS'\n",
    "string7 = 'Hopefully esok boleh ambil gambar dengan'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load multinomial model\n",
    "\n",
    "```python\n",
    "def multinomial(**kwargs):\n",
    "    \"\"\"\n",
    "    Load multinomial emotion model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result : malaya.model.ml.Bayes class\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "90b350beb8d6475fab7d93270582d3ab",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)main/multinomial.pkl:   0%|          | 0.00/3.36M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "a33106620ef4422c8d759b4e7cc0ee0c",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading (…)solve/main/tfidf.pkl:   0%|          | 0.00/2.72M [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "data": {
      "application/vnd.jupyter.widget-view+json": {
       "model_id": "c2d3854c0ef64ab489cac079c13d6b49",
       "version_major": 2,
       "version_minor": 0
      },
      "text/plain": [
       "Downloading bpe.model:   0%|          | 0.00/875k [00:00<?, ?B/s]"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator ComplementNB from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n",
      "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfTransformer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n",
      "/home/husein/.local/lib/python3.8/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator TfidfVectorizer from version 0.22.1 when using version 1.1.2. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:\n",
      "https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations\n",
      "  warnings.warn(\n"
     ]
    }
   ],
   "source": [
    "model = malaya.sentiment.multinomial()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict batch of strings\n",
    "\n",
    "```python\n",
    "def predict(self, strings: List[str]):\n",
    "    \"\"\"\n",
    "    classify list of strings.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings: List[str]\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/husein/dev/malaya/malaya/model/stem.py:28: FutureWarning: Possible nested set at position 3\n",
      "  or re.findall(_expressions['ic'], word.lower())\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['negative',\n",
       " 'negative',\n",
       " 'negative',\n",
       " 'negative',\n",
       " 'neutral',\n",
       " 'negative',\n",
       " 'positive']"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict([string1, string2, string3, string4, string5, string6, string7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict batch of strings with probability\n",
    "\n",
    "```python\n",
    "def predict_proba(self, strings: List[str]):\n",
    "    \"\"\"\n",
    "    classify list of strings and return probability.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings: List[str]\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[dict[str, float]]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[{'negative': 0.5469396890722973,\n",
       "  'neutral': 0.13565086101545018,\n",
       "  'positive': 0.31740944991225484},\n",
       " {'negative': 0.4268497980032366,\n",
       "  'neutral': 0.21946047551031087,\n",
       "  'positive': 0.3536897264864507},\n",
       " {'negative': 0.5915531411178273,\n",
       "  'neutral': 0.1601334910709211,\n",
       "  'positive': 0.2483133678112509},\n",
       " {'negative': 0.5165487021938855,\n",
       "  'neutral': 0.13998199029917185,\n",
       "  'positive': 0.34346930750694543},\n",
       " {'negative': 0.23311742560677587,\n",
       "  'neutral': 0.4182488090323352,\n",
       "  'positive': 0.3486337653608891},\n",
       " {'negative': 0.8494818936945382,\n",
       "  'neutral': 0.060109943158198856,\n",
       "  'positive': 0.0904081631472596},\n",
       " {'negative': 0.2922247908043552,\n",
       "  'neutral': 0.3367232807540181,\n",
       "  'positive': 0.3710519284416263}]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict_proba([string1, string2, string3, string4, string5, string6, string7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### List available HuggingFace models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'mesolitica/sentiment-analysis-nanot5-tiny-malaysian-cased': {'Size (MB)': 93,\n",
       "  'macro precision': 0.67768,\n",
       "  'macro recall': 0.68266,\n",
       "  'macro f1-score': 0.67997},\n",
       " 'mesolitica/sentiment-analysis-nanot5-small-malaysian-cased': {'Size (MB)': 167,\n",
       "  'macro precision': 0.67602,\n",
       "  'macro recall': 0.6712,\n",
       "  'macro f1-score': 0.67339}}"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "malaya.sentiment.available_huggingface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Trained on https://huggingface.co/datasets/mesolitica/chatgpt-explain-sentiment\n",
      "Split 90% to train, 10% to test.\n"
     ]
    }
   ],
   "source": [
    "print(malaya.sentiment.info)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load HuggingFace model\n",
    "\n",
    "```python\n",
    "def huggingface(\n",
    "    model: str = 'mesolitica/sentiment-analysis-nanot5-small-malaysian-cased',\n",
    "    force_check: bool = True,\n",
    "    **kwargs,\n",
    "):\n",
    "    \"\"\"\n",
    "    Load HuggingFace model to classify sentiment.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    model: str, optional (default='mesolitica/sentiment-analysis-nanot5-small-malaysian-cased')\n",
    "        Check available models at `malaya.sentiment.available_huggingface`.\n",
    "    force_check: bool, optional (default=True)\n",
    "        Force check model one of malaya model.\n",
    "        Set to False if you have your own huggingface model.\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: malaya.torch_model.huggingface.Classification\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.\n"
     ]
    }
   ],
   "source": [
    "model = malaya.sentiment.huggingface()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict batch of strings\n",
    "\n",
    "```python\n",
    "def predict(self, strings: List[str]):\n",
    "    \"\"\"\n",
    "    classify list of strings.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings: List[str]\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[str]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.47 s, sys: 348 ms, total: 1.82 s\n",
      "Wall time: 160 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "['negative', 'neutral', 'neutral', 'neutral', 'neutral', 'negative', 'neutral']"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model.predict([string1, string2, string3, string4, string5, string6, string7])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Predict batch of strings with probability\n",
    "\n",
    "```python\n",
    "def predict_proba(self, strings: List[str]):\n",
    "    \"\"\"\n",
    "    classify list of strings and return probability.\n",
    "\n",
    "    Parameters\n",
    "    ----------\n",
    "    strings: List[str]\n",
    "\n",
    "    Returns\n",
    "    -------\n",
    "    result: List[dict[str, float]]\n",
    "    \"\"\"\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.06 s, sys: 0 ns, total: 1.06 s\n",
      "Wall time: 93.1 ms\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'negative': 0.9495719075202942,\n",
       "  'neutral': 0.047029513865709305,\n",
       "  'positive': 0.0033985504414886236},\n",
       " {'negative': 0.29643991589546204,\n",
       "  'neutral': 0.4939780533313751,\n",
       "  'positive': 0.20958206057548523},\n",
       " {'negative': 0.2493346780538559,\n",
       "  'neutral': 0.7501162886619568,\n",
       "  'positive': 0.0005490880575962365},\n",
       " {'negative': 0.0963020920753479,\n",
       "  'neutral': 0.6658434271812439,\n",
       "  'positive': 0.2378544956445694},\n",
       " {'negative': 0.03835646063089371,\n",
       "  'neutral': 0.7759678363800049,\n",
       "  'positive': 0.1856757402420044},\n",
       " {'negative': 0.9871785044670105,\n",
       "  'neutral': 0.009978721849620342,\n",
       "  'positive': 0.002842693356797099},\n",
       " {'negative': 0.023206932470202446,\n",
       "  'neutral': 0.9497118592262268,\n",
       "  'positive': 0.02708127535879612}]"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "%%time\n",
    "\n",
    "model.predict_proba([string1, string2, string3, string4, string5, string6, string7])"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
